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Abstract — Inexpensive RGB-D cameras that give an RGB 
image together with depth data have become widely available. 
We use this data to build 3D point clouds of a full scene. In this 
paper, we address the task of labeling objects in this 3D point 
cloud of a complete indoor scene such as an office. We propose 
a graphical model that captures various features and contextual 
relations, including the local visual appearance and shape cues, 
object co-occurrence relationships and geometric relationships. 
With a large number of object classes and relations, the model's 
parsimony becomes important and we address that by using 
multiple types of edge potentials. The model admits efficient 
approximate inference, and we train it using a maximum-margin 
learning approach. In our experiments over a total of 52 3D 
scenes of homes and offices (composed from about 550 views, 
having 2495 segments labeled with 27 object classes), we get a 
performance of 84.06% in labeling 17 object classes for offices, 
and 73.38% in labeling 17 object classes for home scenes. Finally, 
we applied these algorithms successfully on a mobile robot for 
the task of finding an object in a large cluttered room. 

I. Introduction 

Inexpensive RGB-D sensors that augment an RGB image 
with depth data have recently become widely available. At 
the same time, years of research on SLAM (Simultaneous 
Localization and Mapping) now make it possible to merge 
multiple RGB-D images into a single point cloud, easily 
providing an approximate 3D model of a complete indoor 
scene (e.g., a room). In this paper, we explore how this move 
from part-of- scene 2D images to full- scene 3D point clouds 
can improve the richness of models for object labeling. 

In the past, a significant amount of work has been done in 
semantic labeling of 2D images. However, a lot of valuable 
information about the shape and geometric layout of objects 
is lost when a 2D image is formed from the corresponding 
3D world. A classifier that has access to a full 3D model, can 
access important geometric properties in addition to the local 
shape and appearance of an object. For example, many objects 
occur in characteristic relative geometric configurations (e.g., 
a monitor is almost always on a table), and many objects 
consist of visually distinct parts that occur in a certain relative 
configuration. More generally, a 3D model makes it easy to 
reason about a variety of properties, which are based on 3D 
distances, volume and local convexity. 

In our work, we first use SLAM in order to compose mul- 
tiple views from a Microsoft Kinect RGB-D sensor together 
into one 3D point cloud, providing each RGB pixel with an 
absolute 3D location in the scene. We then (over-) segment 
the scene and predict semantic labels for each segment (see 
Fig. [T]). We predict not only coarse classes like in |i 2J (i-^., 
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wall, ground, ceiling, building), but also label individual ob- 
jects (e.g., printer, keyboard, mouse). Furthermore, we model 
rich relational information beyond an associative coupling of 
labels f2|. 

In this paper, we propose and evaluate the first model and 
learning algorithm for scene understanding that exploits rich 
relational information derived from the full-scene 3D point 
cloud for object labeling. In particular, we propose a graphical 
model that naturally captures the geometric relationships of a 
3D scene. Each 3D segment is associated with a node, and 
pairwise potentials model the relationships between segments 
(e.g., co-planarity, convexity, visual similarity, object occur- 
rences and proximity). The model admits efficient approximate 
inference |3|, and we show that it can be trained using a 
maximum-margin approach (H |5] |6l that globally minimizes 
an upper bound on the training loss. We model both associative 
and non-associative coupling of labels. With a large number 
of object classes, the model's parsimony becomes important. 
Some features are better indicators of label similarity, while 
other features are better indicators of non- associative relations 
such as geometric arrangement (e.g., "on top of," "in front of). 
We therefore model them using appropriate clique potentials 
rather than using general clique potentials. Our model is 
highly flexible and we have made our software available for 
download to other researchers in this emerging area of 3D 
scene understanding. 

To empirically evaluate our model and algorithms, we 
perform several experiments over a total of 52 scenes of two 
types: offices and homes. These scenes were built from about 
550 views from the Kinect sensor, and they will also be made 
available for public use. We consider labeling each segment 
(from a total of about 50 segments per scene) into 27 classes 
(17 for offices and 17 for homes, with 7 classes in common). 
Our experiments show that our method, which captures sev- 
eral local cues and contextual properties, achieves an overall 
performance of 84.06% on office scenes and 73.38% on home 
scenes. We also consider the problem of labeling 3D segments 
with multiple attributes meaningful to robotics context (such as 
small objects that can be manipulated, furniture, etc.). Finally, 
we successfully applied these algorithms on a mobile robot 
for the task of finding an object in a large cluttered room. 

II. Related Work 
There is a huge body of work in the area of scene un- 
derstanding and object recognition from 2D images. Previous 
works focus on several different aspects: designing good local 
features such as HOG (histogram-of-gradients) Q and bag of 
words 18J, designing good global (context) features such as 




Fig. 1. Office scene (top) and Home scene (bottom with the corresponding label coloring above the images. The left-most is the original 
point cloud, the middle is the ground truth labeling and the right most is the point cloud with predicted labels. 



GIST features f9l, and combining multiple tasks [10|. How- 
ever, these approaches do not consider the relative arrangement 
of the parts of the object or of multiple objects with respect to 
each other. A number of works propose models that explicitly 
capture the relations between different parts of the object ifTTIl . 
and between different objects in 2D images (IJKTSl. However, 
a lot of valuable information about the shape and geometric 
layout of objects is lost when a 2D image is formed from 
the corresponding 3D world. In some recent works, 3D layout 
or depths have been used for improving object detection (e.g., 
||Tl|l5l[T6l[T7l[l8 1). Here a rough 3D scene geometry (e.g., the 
main surfaces in a scene) is inferred from a single 2D image or 
a stereo video, respectively. However, the estimated geometry 
is not accurate enough to give significant improvements. With 
3D data, we can more precisely determine the shape, size 
and geometric orientation of the objects, and several other 
properties and therefore capture much stronger context. 

The recent availability of synchronized videos of both color 
and depth obtained from RGB-D (Kinect- style) depth cameras, 
shifted the focus to making use of both visual as well as 
shape features for object detection dl |20l |2ll|22l |23l and 
3D segmentation (e.g., |24|). These methods demonstrate 
that augmenting visual features with 3D information can 
enhance object detection in cluttered, real- world environments. 
However, these works do not make use of the contextual 
relationships between various objects which have been shown 
to be useful for tasks such as object detection and scene 
understanding in 2D images. Our goal is to perform semantic 
labeling of indoor 3D scenes by modeling and learning several 
contextual relationships. 

There is also some recent work in labeling outdoor scenes 
obtained from LIDAR data into a few geometric classes (e.g., 
ground, building, trees, vegetation, etc.). ||25| |26l capture 
context by designing node features and f27l do so by stacking 
layers of classifiers; however these methods do not model the 
correlation between the labels. Some of these works model 
some contextual relationships in the learning model itself. For 
example, |2, 28] use associative Markov networks in order to 
favor similar labels for nodes in the cliques. However, many 
relative features between objects are not associative in nature. 
For example, the relationship "on top of does not hold in 
between two ground segments, i.e., a ground segment cannot 
be "on top of another ground segment. Therefore, using an 



associative Markov network is very restrictive for our problem. 
All of these works O |26l 123 were designed for outdoor 
scenes with LIDAR data (without RGB values) and therefore 
would not apply directly to RGB-D data in indoor environ- 
ments. Furthermore, these methods only consider very few 
geometric classes (between three to five classes) in outdoor 
environments, whereas we consider a large number of object 
classes for labeling the indoor RGB-D data. 

The most related work to ours is 1 1 1, where they label the 
planar patches in a point-cloud of an indoor scene with four 
geometric labels (walls, floors, ceilings, clutter). They use a 
CRF to model geometrical relationships such as orthogonal, 
parallel, adjacent, and coplanar. The learning method for esti- 
mating the parameters was based on maximizing the pseudo- 
likelihood resulting in a sub-optimal learning algorithm. In 
comparison, our basic representation is 3D segments (as 
compared to planar patches) and we consider a much larger 
number of classes (beyond just the geometric classes). We 
capture a much richer set of relationships between pairs of 
objects, and use a principled max-margin learning method to 
learn the parameters of our model. 

III. Approach 
We now outline our approach, including the model, its 
inference methods, and the learning algorithm. Our input is 
multiple Kinect RGB-D images of an indoor scene stitched 
into a single 3D point cloud using RGBDSLAM |29|. Each 
such point cloud is then over-segmented based on smoothness 
(i.e., difference in the local surface normals) and continuity 
of surfaces (i.e., distance between the points). These segments 
are the atomic units in our model. Our goal is to label each 
of them. 

Before getting into the technical details of the model, the 
following outlines the properties we aim to capture: 
Visual appearance. The reasonable success of object detec- 
tion in 2D images shows that visual appearance is a good 
indicator for labeling scenes. We therefore model the local 
color, texture, gradients of intensities, etc. for predicting the 
labels. In addition, we also model the property that if nearby 
segments are similar in visual appearance, they are more likely 
to belong to the same object. 

Local shape and geometry. Objects have characteristic 
shapes — for example, a table is horizontal, a monitor is 
vertical, a keyboard is uneven, and a sofa is usually smoothly 



curved. Furthermore, parts of an object often form a convex 
shape. We compute 3D shape features to capture this. 
Geometrical context. Many sets of objects occur in character- 
istic relative geometric configurations. For example, a monitor 
is always on-top-of a table, chairs are usually found near 
tables, a keyboard is in-front-of a monitor. This means that our 
model needs to capture non- associative relationships (i.e., that 
neighboring segments differ in their labels in specific patterns). 

Note that the examples given above are just illustrative. For 
any particular practical application, there will likely be other 
properties that could also be included. As demonstrated in the 
following section, our model is flexible enough to include a 
wide range of features. 
A. Model Formulation 

We model the three-dimensional structure of a scene using a 
model isomorphic to a Markov Random Field with log-linear 
node and pairwise edge potentials. Given a segmented point 
cloud X = (xi,...,XAr) consisting of segments x^, we aim 
to predict a labeling y = ^at) for the segments. Each 

segment label yi is itself a vector of K binary class labels 
Vi — {yh '"^yf)^ with each G {0, 1} indicating whether a 
segment i is a member of class k. Note that multiple y^ can 
be 1 for each segment (e.g., a segment can be both a "chair" 
and a "movable object"). We use such multi-labelings in our 
attribute experiments where each segment can have multiple 
attributes, but not in segment labeling experiments where each 
segment can have only one label. 

For a segmented point cloud x, the prediction y is computed 
as the argmax of a discriminant function /w(x,y) that is 
parameterized by a vector of weights w. 



argmax /w(x,y) 
y 



(1) 



The discriminant function captures the dependencies between 
segment labels as defined by an undirected graph (V^S) 
of vertices V = |1, A^j and edges £ C V x V. We 
describe in Section |lII-B| how this graph is derived from the 
spatial proximity of the segments. Given (V, we define the 
following discriminant function based on individual segment 
features 0n(O and edge features (j^tihj) as further described 
below. 
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(2) 



{i,j)e£Tter (i,k)eTt 



The node feature map 0n(O describes segment i through a 
vector of features, and there is one weight vector for each 
of the K classes. Examples of such features are the ones 
capturing local visual appearance, shape and geometry. The 
edge feature maps j) describe the relationship between 
segments i and j. Examples of edge features are the ones 
capturing similarity in visual appearance and geometric con- 
text]^ There may be multiple types t of edge feature maps 

^Even though it is not represented in the notation, note that both the node 
feature map (/)n(^) and the edge feature maps (/>t(i,j) can compute their 
features based on the full x, not just Xi and xj . 



(j)t{i^j), and each type has a graph over the K classes with 
edges Tt. If Tt contains an edge between classes / and k, then 
this feature map and a weight vector w].^ is used to model 
the dependencies between classes / and k. If the edge is not 
present in Tt, then (j)t{hj) is not used. 

We say that a type t of edge features is modeled by an 
associative edge potential if Tt = {{k^k)\\/k = 1..K}. 
And it is modeled by an non-associative edge 
potential if Tt = {{l,k)\yi,k = 1..K}. Finally, it 
is modeled by an object-associative edge potential if 
Tt = {(/, k)\3object^ l^k e parts(object)} 

Parsimonious model. In our experiments we distinguished 
between two types of edge feature maps — "object-associative" 
features (j)oa{hj) used between classes that are parts of the 
same object (e.g., "chair base", "chair back" and "chair back 
rest"), and "non- associative" features (j)na{hj) th^t ^re used 
between any pair of classes. Examples of features in the 
object-associative feature map (j)oa{hj) include similarity in 
appearance, co-planarity, and convexity — i.e., features that 
indicate whether two adjacent segments belong to the same 
class or object. A key reason for distinguishing between 
object-associative and non-associate features is parsimony 
of the model. In this parsimonious model (referred to as 
svm_mrf_parsimon), we model object associative features 
using object-associative edge potentials and non-associative 
features as non-associative edge potentials. As not all edge fea- 
tures are "non-associative", we avoid learning weight vectors 
for relationships which do not exist. Note that \Tna\ » \Toa\ 
since, in practice, the number of parts of an objects is much 
less than K. Due to this, the model we learn with both type 
of edge features will have much lesser number of parameters 
compared to a model learnt with all edge features as "non- 
associative features". 

B. Features 

Table |l] summarizes the features used in our experiments. 
XiQ^Xii and Xi2 are the 3 eigen- values of the scatter matrix 
computed from the points of segment i in increasing order. Ci 
is the centroid of segment i. is the ray vector to the Ci from 
the camera in which it was captured, rhi is the projection of 
on horizontal plane, hi is the unit normal of segment i which 
points towards the camera (r^.n^ < 0). 

The node features ^n(0 consist of visual appearance fea- 
tures based on histogram of HSV values and the histogram of 
gradients (HOG), as well as local shape and geometry features 
that capture properties such as how planar a segment is, its 
absolute location above ground, and its shape. Some features 
capture spatial location of an object in the scene (e.g., N9). 

We connect two segments (nodes) i and j by an edge if 
there exists a point in segment i and a point in segment j 
which are less than context_range distance apart. This captures 
the closest distance between two segments (as compared to 
centroid distance between the segments) — we study the effect 

The edge features (j^tihj) 
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of context range more in Section 
(Table [ij-right) consist of associative features (El) based on 
visual appearance and local shape, as well as non-associative 



TABLE I 

Node features for segment i. 



Description 



Visual Appearance 

Nl. Histogram of HSV color values 
N2. Average HSV color values 

N3. Average of HOG features of the blocks in image spanned by the 
points of a segment 



Local Shape and Geometry 

N4. linearness (A^o - A^i), planarness (A^i 
N6. Vertical component of the normal: ^ 
N7. Vertical position of centroid: Ci^ 
N8. Vert, and Hor. extent of bounding box 
N9. Dist. from the scene boundary 



Ai2),Scatter: A^q 



Count 



48 

14 

3 

31 



Features for edge (segment i, segment j). 



Visual Appearance (associative) 

El. Difference of avg HSV color values 



Local Shape and Geometry (associative) 

E2. Coplanarity and convexity 



Geometric context (non-associative) 

E3. Horizontal distance b/w centroids. 

E4. Vertical Displacement b/w centroids: {ciz — < 

E5. Angle between normals (Dot product): ■hi-fi 

E6. Diff. in angle with vert: cos~^ {riiz) - cos~ 

E7. Dist. between closest points 

E8. rel. position from camera (in front of/behind). 



features (E3-E8) that capture the tendencies of two objects 
to occur in certain configurations. Note that our features are 
insensitive to horizontal translation and rotation of the camera. 
However, our features place a lot of emphasis on the vertical 
direction because gravity influences the shape and relative 
positions of objects to a large extent. 
C. Learning and Inference 

Solving the argmax in Eq. fl] for the discriminant function 
in Eq. [Sjis NP hard. It can fee formulated as the following 
mixed-integer program, which can be solved by a general- 
purpose MIP solveij^ in about 20 minutes on a typical scene. 

K 

5^rgmax max ^ ^ 
y ^ iev k 

+ E E E 



=1 
Ik 



Ik 

Wt 



(3) 



I I k 

Vi + Vj 



^ Ik , -i 
< Zij + 1 



Ik I 



{0,1}, \/^:J2yi = ^ 



(4) 



However, if we remove the last constraint (|4]), 



the variables z-^ and yl to the interval [0,1], 



and relax 
we get a 

linear relaxation that can be shown to always have half- 
integral solutions (i.e. yl only take values {0,0.5,1} at the 
solution) ISOl . Furthermore, this relaxation can also be solved 
as a quadratic pseudo-Boolean optimization problem using a 
graph-cut method |3|, which is orders of magnitude faster than 
using a general purpose LP solver (i.e., 2 sec for labeling a 
typical full scene in our experiments, and 0.2 sec for a single 
view). For training, we use the software which 

uses the cutting plane method to jointly learn values of Wn 
and Wt's so as to minimize a regularized upper bound on the 
training error. 

IV. Experiments 

A. Data 

We consider labeling object segments in full 3D scene 
(as compared to 2.5D data from a single view). For this 

^ http : //w w w. tfinley . net/ sof tware/py glpk/readme .html 
^http://svmlight.joachims.org/svm_struct.html 



purpose, we collected data of 24 office and 28 home scenes. 
Each scene was reconstructed from about 8-9 RGB-D views 
from a Microsoft Kinect sensor and we have a total of about 
550 views. Each scene contains about a million colored 
points. We first over-segment the 3D scene (as described 
earlier) to obtain the atomic units of our representation. 
For training, we manually labeled the segments, and we 
selected the labels which were present in a minimum of 
5 scenes in the dataset. Specifically, the office labels are: 
{wall, floor, tableTop, tableDrawer, tableLeg, chairBackRest, 
chairBase, chairBack, monitor, printerFront, printerSide 
keyboard, cpuTop, cpuFront, cpuSide, book, paper}, and 
the home labels are: {wall, floor, tableTop, tableDrawer, 
tableLeg, chairBackRest, chairBase, sofaBase, sofaArm, 
sofaBackRest, bed, bedSide, quilt, pillow, shelfRack, laptop, 
book}. This gave us a total of 1108 labeled segments in the 
office scenes and 1387 segments in the home scenes. Often 
one object may be divided into multiple segments because 
of over-segmentation. We have made this data available at: 
http : //pr . cs . Cornell . edu/sceneunder standing. 

B. Results 

Table |ll| shows the results, performed using 4-fold cross- 
validation and averaging performance across the folds for the 
models trained separately on home and office datasets. We use 
both the macro and micro averaging to aggregate precision 
and recall over various classes. Since our algorithm can only 
predict one label per segment, micro precision and recall are 
same as the percentage of correctly classified segments. Macro 
precision and recall are respectively the averages of precision 
and recall for all classes. The optimal C value is determined 
separately for each of the algorithms by cross-validation. 

Figure [T] shows the original point cloud, ground-truth and 
predicted labels for one office (top) and one home scene 
(bottom). We see that on majority of the classes our model 
predicts the correct label. It makes mistakes on some tricky 
cases, such as a pillow getting confused with the bed, and 
table-top getting confused with the shelf-rack. 

One of our goals is to study the effect of various factors, 
and therefore we compared various versions of the algorithms 
with various settings. We discuss them in the following. 
Do Image and Point-Cloud Features Capture Complimen- 
tary Information? The RGB-D data contains both image 
and depth information, and enables us to compute a wide 
variety of features. In this experiment, we compare the two 
kinds of features: Image (RGB) and Shape (Point Cloud) 
features. To show the effect of the features independent of 
the effect of context, we only use the node potentials from 
our model, referred to as svm_node_only in Table ^ The 
svm_node_only model is equivalent to the multi-class SVM 
formulation [31 1. Table [Il| shows that Shape features are more 
effective compared to the Image, and the combination works 
better on both precision and recall. This indicates that the two 
types of features offer complementary information and their 
combination is better for our classification task. 
How Important is Context? Using our svm_mrf_parsimon 



model as described in Section III-A we show significant 



TABLE II 

Average micro precision/recall, average macro precision and recall for home and office scenes. 
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improvements in the performance over using svm_node_only 
model on both datasets. In office scenes, the micro precision 
increased by 6.09% over the best svm_node_only model that 
does not uses any context. In home scenes the increase is much 
higher, 16.88%. 

The type of contextual relations we capture depend on 
the type of edge potentials we model. To study this, we 
compared our method with models using only associative 
(svm_mrf_assoc) or only non- associative (svm_mrf_nonassoc) 
edge potentials. We observed that modeling all edge features 
using associative potentials is poor compared to our full 
model. In fact, using only associative potentials showed a 
drop in performance compared to svm_nodeonly model on 
the office dataset. This indicates it is important to capture 
the relations between regions having different labels. Our 
svm_mrf_non_assoc model does so, by modeling all edge 
features using non-associative potentials, which can favour or 
disfavour labels of different classes for nearby segments. It 
gives higher precision and recall compared to svm_nodeonly 
and svm_inrf_assoc. 

However, not all the edge features are non-associative 
in nature, modeling them using only non-associative po- 
tentials could be an overkill (each non- associative feature 
adds K'^ more parameters to be learnt). Therefore using our 
svm_mrf_parsimon model to model these relations achieves 
higher performance in both datasets. 

How Large should the Context Range be? 

Context relationships 
of different objects 
can be meaningful 
for different spatial 
distances. This range 
may vary depending 
on the environment as 
well. For example, in 
an office, keyboard and 
monitor go together, 
but they may have little 
relation with a sofa that is slightly farther away. In a house, 
sofa and table may go together even if they are farther away. 

In order to study this, we compared our svm_mrf_parsimon 
with varying context range for determining the neighborhood 
(see Figure [2] for average micro precision vs range plot). Note 
that the context range is determined from the boundary of one 
segment to the boundary of the other, and hence it is somewhat 



Office Micro Precision — x— 
Home Micro Precision 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Context Range 

Fig. 2. Effect of context range on 
precision (^recall here). 



independent of the size of the object. We note that increasing 
the context range increases the performance to some level, 
and then it drops slightly. We attribute this to the fact that 
with increasing the context range, irrelevant objects may get 
an edge in the graph, and with limited training data, spurious 
relationships may be learned. We observe that the optimal 
context range for office scenes is around 0.3 meters and 0.6 
meters for home scenes. 

How does a Full 3D Model Compare to a 2.5D Model? In 
Table |llj we compare the performance of our full model with 
a model that was trained and tested on single views of the 
same scene. During the comparison, the training folds were 
consistent with other experiments, however the segmentation 
of this point-cloud was different (because the input point-cloud 
itself is from single view). This makes the micro precision 
values not meaningful because the distribution of labels is not 
same for the two cases. In particular, many large object in 
scenes (e.g., wall, ground) get split up into multiple segments 
in single views. We observed that the macro precision and 
recall are higher when multiple views are combined to form 
the scene. We attribute the improvement in macro precision 
and recall to the fact that larger scenes have more context, 
and models are more complete because of multiple views. 
What is the Effect of the Inference Method? The results 
for svm_mrf algorithms Table [ll| were generated using the 
MIP solver. The graph-cut algorithm however, gives a higher 
precision and lower recall on both datasets. For example, on 
office data, the graphcut inference for our svm_mrf_parsimon 
gave a micro precision of 90.25 and micro recall of 61.74. 
Here, the micro precision and recall are not same as some of 
the segments might not get any label. Since it is orders of 
magnitude faster, it is ideal for realtime robotic applications. 
C. Robotic experiments 

The ability to label segments is very useful for robotics 
applications, for example, in detecting objects (so that a robot 
can find/retreive an object on request) or for other robotic tasks 
such as manipulation. We therefore performed two relevant 
robotic experiments. 

Attribute Learning: In some robotic tasks, such as robotic 
grasping |32| or placing [33 1, it is not important to know 
the exact object category, but just knowing a few attributes 
of an object may be useful. For example, if a robot has to 
clean a floor, it would help if it knows which objects it can 
move and which it cannot. If it has to place an object, it 
should place them on horizontal surfaces, preferably where 




Fig. 3. Cornell's POLAR (PersOnaL Assistant Robot) using our 
classifier for detecting a keyboard in a cluttered room. 

humans do not sit. With this motivation we have designed 8 
attributes, each for the home and office scenes, giving a total of 
10 unique attributes, comprised of: wall, floor, flat-horizontal- 
surfaces, furniture, fabric, heavy, seating-areas, small-objects, 
table-top-objects, electronics. Note that each segment in the 
point cloud can have multiple attributes and therefore we can 
learn these attributes using our model which naturally allows 
multiple labels per segment. We compute the precision and 
recall over the attributes by counting how many attributes were 
correctly inferred. In home scenes we obtained a precision of 
83.12% and 70.03% recall, and in the office scenes we obtain 
87.92% precision and 71.93% recall. 

Robotic Object Detection: We finally use our algorithm 
on a mobile robot, mounted with a Kinect, for completing 
the goal of finding an object such as a keyboard in 
an extremely cluttered room (Fig. |3]). The following video 
shows our robot successfully finding the keyboard in an office: 
http : //pr . cs . Cornell . edu/sceneunderstanding 

In conclusion, we have proposed and evaluated the first 
model and learning algorithm for scene understanding that 
exploits rich relational information from full- scene 3D point 
clouds. We applied this technique to object labeling problem, 
and studied affects of various factors on a large dataset. 
Our robotic applications shows that such inexpensive RGB-D 
sensors can be quite useful for scene understanding by robots. 
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